-
Notifications
You must be signed in to change notification settings - Fork 375
Add loadEntities batch call and rename listFullEntities #2508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add loadEntities batch call and rename listFullEntities #2508
Conversation
...bc/src/main/java/org/apache/polaris/persistence/relational/jdbc/JdbcBasePersistenceImpl.java
Show resolved
Hide resolved
polaris-core/src/main/java/org/apache/polaris/core/entity/PolarisBaseEntity.java
Show resolved
Hide resolved
...-core/src/main/java/org/apache/polaris/core/persistence/AtomicOperationMetaStoreManager.java
Outdated
Show resolved
Hide resolved
polaris-core/src/main/java/org/apache/polaris/core/persistence/PolarisMetaStoreManager.java
Outdated
Show resolved
Hide resolved
polaris-core/src/main/java/org/apache/polaris/core/persistence/PolarisMetaStoreManager.java
Outdated
Show resolved
Hide resolved
1e05a56 to
05d7f49
Compare
| * NULL if the entity has been dropped. | ||
| */ | ||
| @Nonnull | ||
| ResolvedEntitiesResult loadResolvedEntities( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existence of this method may cause confusion since the single-lookup methods have both ByName and ById variations, and an EntityNameLookupRecord happens to have the name in it, and yet it looks like the actual key difference between the two methods is whether we have EntityType per item.
I'm not sure if we already use EntityNameLookupRecord as an input argument anywhere else, but generally since it was kind of structured as an output argument before it seems it becomes ambiguous as an input argument.
And at first glance the unittests seems to imply that the lookup would be "by name":
@ParameterizedTest
@ValueSource(strings = {"id", "name"})
....
if (loadType.equals("id")) {
// Create entity ID list with the updated entity
List<PolarisEntityId> entityIds = List.of(getPolarisEntityId(T6v2));
// Call batch load - this should detect the stale version and reload
results =
cache.getOrLoadResolvedEntities(this.callCtx, PolarisEntityType.TABLE_LIKE, entityIds);
} else {
results =
cache.getOrLoadResolvedEntities(this.callCtx, List.of(new EntityNameLookupRecord(T6v2)));
}
To match convention with the single-item lookups can we rename these methods to say loadResolvedEntitiesById?
And if the difference is really just whether we pass in a per-entity EntityType, I think even parallel Lists (List<EntityId>, List<EntityType> would be better than reusing EntityNameLookupRecord just for its catalogId, entityId, entityType.
Alternatively, cleanest would be just having one interface entirely with List<EntityIdAndType> as the input argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing to consider is whether we actually want to allow different entity type lookups within a single batch. It may change the atomicity semantics for persistence implementations where different types are in different atomicity domains.
Are there any callsites that actually rely on using this form of the method instead of the one with a single entityType across the whole list of EntityIds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, I can't think of a use case where we would mix EntityTypes in a single call. The two immediate use cases I have in mind are
- Batch loading the principal roles during the authentication step
- Support for loading TableMetadata from persistence rather than from cloud storage (this is in concert with Add properties from TableMetadata into Table entity internalProperties #2735 and other future PRs).
In both cases, only a single EntityType is loaded. I used the EntityNameLookupRecord type largely because it is the return type for the listEntities API, but I wanted to avoid fetching all entities in full in the case that many/most entities are already in cache. Personally, I don't like the pattern of using parallel list parameters for an API, so I would oppose the List<EntityId>, List<EntityType> option. I am ok with a new EntityIdAndType argument, but I'd also be ok with just supporting the one API that takes in the EntityType and List<EntityId> arguments and getting rid of the other option until a need arises.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sounds good, I think it's good to have the more opinionated "all entities in the batch are the same EntityType" method signature for now, as it's easier to ensure different impls can fulfill the interface. Signature (EntityType, List<EntityId>) looks good to me, and let's remove the overloaded method regarding EntityNameLookupRecord.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not have any concerns with the current state of this PR, but I'd be interested in reviewing again after comments from @dennishuo are resolved :)
| if (e == null) { | ||
| return null; | ||
| } else { | ||
| // load the grant records |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe add toResolvedPolarisEntity() as in AtomicOperationMetaStoreManager?
6804058 to
2289355
Compare
| * NULL if the entity has been dropped. | ||
| */ | ||
| @Nonnull | ||
| ResolvedEntitiesResult loadResolvedEntities( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name LGTM, but I guess it does not match the PR description anymore 🤔
dimas-b
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM 👍 but I'll defer approval to @dennishuo .
dennishuo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
2289355 to
beabf6b
Compare
* Add loadEntities batch call and rename listFullEntities (apache#2508) * Add loadEntities batch call and rename listFullEntities * Changed batch call to implement loadResolvedEntities instead * Add loadResolvedEntities by id and entity cache support * Add additional test for loadResolvedEntities by id * Added additional test and updated comments in EntityCache interface * Add additional constructor to ResolvedEntitiesResult * Fixed unused method reference * Removed loadResolvedEntities method with lookup record param * Pulled out toResolvedPolarisEntity method per PR comment * Core: made the ARN role regex more generic (apache#3005) * fix(docs): Generify S3 index page (apache#2997) * Remove the mention of "cloud" since not all possible storage options are provided in "cloud". * Avoid listing specific child pages in the doc test. Rely on Hugo-general index (on the left-hand pane). --------- Co-authored-by: Alexandre Dutra <[email protected]> * fix(deps): update dependency io.prometheus:prometheus-metrics-exporter-servlet-jakarta to v1.4.3 (apache#3009) * fix(deps): update dependency com.google.cloud:google-cloud-storage-bom to v2.60.0 (apache#3011) * fix(deps): update dependency io.opentelemetry:opentelemetry-bom to v1.56.0 (apache#3012) * fix(deps): update dependency com.adobe.testing:s3mock-testcontainers to v4.10.0 (apache#3010) * fix(deps): update dependency org.agrona:agrona to v2.3.2 (apache#3014) * fix(deps): update quarkus platform and group to v3.29.2 (apache#3013) * chore(deps): update dependency pre-commit to v4.4.0 (apache#3015) * fix(deps): update dependency software.amazon.awssdk:bom to v2.38.2 (apache#3019) * Add test for TracingFilter (apache#2847) * NoSQL: Add (micro-ish) benchmarks (apache#3006) A project for JMH based benchmarks against NoSQL persistence. * Helm chart: include configmap checksum in deployment annotations (apache#3023) * fix(deps): update dependency ch.qos.logback:logback-classic to v1.5.21 (apache#3025) * NoSQL: Realms handling (apache#3007) Introduces handling for realms including realm-state management/transition. The `RealmStore` implementation for NoSQL depends on CDI components, coming in a follo-up PR. * Rename AccessConfig and AccessConfigProvider for clarity (apache#2883) * rename AccessConfig for clarity * rename getStorageAccessConfig() and add javadoc * Refactor: improve and clean up Dockerfiles (apache#2957) * Refactor: improve and clean up Dockerfiles * Refactor: improve and clean up Dockerfiles * Refactor: improve and clean up Dockerfiles * Refactor: improve and clean up Dockerfiles * Refactor: improve and clean up Dockerfiles * Refactor: improve and clean up Dockerfiles * Make StorageAccessConfigProvider request-scoped (apache#2974) - add `StorageCredentialsVendor` as request-scoped wrapper around `PolarisCredentialVendor` - make `FileIOFactory` request-scoped - make `TaskFileIOSupplier` request-scoped * Increase javadoc visibility in `nosql/realms` (apache#3029) This is to fix javadoc error: `No public or protected classes found to document` * NoSQL: Add correctness tests (apache#3027) Verifies the correctness of concurrent commits, and big index handling. These tests are intentionally _not_ part of the base-backend test suite for two reasons: 1. These tests do not run against the `Backend` interface but the `Persistence` interface, including commit and index logic. 2. These tests are intended to be runnable against a custom provisioned database cluster, not just tiny-ish test containers. * NoSQL: Add maintenance API, SPI (apache#3028) Maintenance operations include a bunch of tasks that are regularly executed against a backend database. Types of maintenance operations include: * Purging unreferenced objects and references within a catalog * Purging whole catalogs that are marked to be purged * Purging whole realms that are marked to be purged Implementation added in a follow-up PR. * Embrace request-scoped TokenBroker (apache#3024) * Embrace request-scoped TokenBroker `TokenBroker` and `CallContext` are both request-scoped, so instead of passing the former into the latter, we can do this via the `TokenBrokerFactory` and thus simplify the `TokenBroker` interface. * fix(deps): update dependency io.smallrye:jandex to v3.5.2 (apache#3032) * Fix monkey patching (apache#3016) * chore(deps): update quay.io/keycloak/keycloak docker tag to v26.4.5 (apache#3034) * chore(deps): update registry.access.redhat.com/ubi9/openjdk-21-runtime docker tag to v1.23-6.1762870925 (apache#3053) * fix(deps): update dependency org.testcontainers:testcontainers-bom to v2.0.2 (apache#3054) * chore(deps): update postgres docker tag to v18.1 (apache#3055) * Add Polaris Community Meeting 2025-11-13 (apache#3060) * Site: Rename menu "downloads" to "releases" (apache#2928) * Update dependency software.amazon.awssdk:bom to v2.38.7 (apache#3065) * Test-fix: Cleanup OPA test container on stop (apache#3041) Quarkus takes care of reusing a test-resource across tests. The current behavior leaves the container around. Plus some nit-fixes (deprecation + local var) * Update dependency org.apache.commons:commons-lang3 to v3.20.0 (apache#3063) * Build: ensure LICENSE/NOTICE is in all jars, always add pom-files to all jars (apache#3057) There are a some inconsistencies between the different kinds of jars and the included information: * LICENSE/NOTICE files are present in the "main" jar and in the sources jar, but not in the javadoc jar. * The Maven pom.xml and pom.properties files are only present for release builds or when explicitly requested. * "Additional" jar-manifest attributes that are only present in release builds. This change fixes the three mentioned issues: * Always include pom.xml and pom.properties in the built jar files. * Always include the additional jar-manifest attributes, except the Git information, which would otherwise render the Gradle build cache ineffective. * Include pom.xml + pom.properties + license/notice in literally all jar files. The Gradle logic to include the license+notice+pom files has been simplified as well. * Remove unused polarisEventListener field from IcebergCatalogHandler (apache#3045) it was added in c3f5001 but then its only usage was removed in d03c717 * fix(deps): update quarkus platform and group to v3.29.3 (apache#3052) * Site: Add Open Policy Agent (OPA) as External Policy Decision Point (apache#3030) Doc PR following up the introduction of OpaPolarisAuthorizer: apache#2680 * OPA: Tackle deprecation warnings (apache#3042) Instead of suppressing the deprecations, this change updates the code a little bit to remove the mocks (except to create a non-nullable parameter). * Use POJOs for OPA JSON schema construction and publish schema (apache#3031) Co-authored-by: Robert Stupp <[email protected]> * Use CDI for more test setups (apache#3040) this avoids a bunch of redundant manual setup. the important parts are establishing a `RealmContext` by calling `QuarkusMock.installMockForType` and then populating `polarisContext` from the injected `CallContext`. * fix(deps): update dependency com.github.dasniko:testcontainers-keycloak to v4 (apache#3070) * chore(deps): update actions/checkout digest to 93cb6ef (apache#3068) * OPA: Fail fast when OPA bearer token file is unreadable (apache#3062) * fix(deps): update immutables to v2.11.7 (apache#3072) * Skip Hugo Site workflow on forks (apache#3056) Forks usually don't have the "versioned-docs" tag and thus PRs against forks or rebasing the main branch on a fork currently always causes workflow failures. * Fix warnings around TransactionWorkspaceMetaStoreManager (apache#3044) - dont return `null` for interface methods that are `@Nonnull` - fix wrong method name parameters - dont annotate void methods as `@Nonnull` * NoSQL: Add CDI/common+testing + necessary nosql-store implementations (apache#3035) Adds common and test-specific CDI functionality. Requires the NoSQL store implementations `:polaris-persistence-nosql-realms-store-nosql` and `:polaris-nodes-store-nosql`. Those modules have cross-project dependencies for test purposes, hence those are all contained in this PR. CDI for Quarkus will be added in a follow-up. * Automate the release guide - Take 2 - Github workflows (apache#2383) The release automation is simplified to four GitHub workflows that just require the really mandatory user input: the version number. 1. workflow: Trigger the creation of the release branch 2. workflow: Upgrade the release branch with the version and build the the final change-log for that version 3. workflow: Build the RC artifacts from the release branch and push those to the various staging repositories 4. workflow: Eventually release the artifacts. See also the [email announcement](https://lists.apache.org/thread/d0smz07gnr509yj5dc6omo3cvkf1pnh7). --------- Co-authored-by: Robert Stupp <[email protected]> * Update actions/checkout digest to 93cb6ef (apache#3082) * NoSQL: adapt to conflicting changes in main * Last merged commit 8ccddc5 --------- Co-authored-by: Michael Collado <[email protected]> Co-authored-by: cccs-cat001 <[email protected]> Co-authored-by: Dmitri Bourlatchkov <[email protected]> Co-authored-by: Alexandre Dutra <[email protected]> Co-authored-by: Mend Renovate <[email protected]> Co-authored-by: Nuoya Jiang <[email protected]> Co-authored-by: Yong Zheng <[email protected]> Co-authored-by: Christopher Lambert <[email protected]> Co-authored-by: JB Onofré <[email protected]> Co-authored-by: Yufei Gu <[email protected]> Co-authored-by: Sung Yun <[email protected]> Co-authored-by: Pierre Laporte <[email protected]>
#2290 introduced a new
loadEntitiesvariant, which is really alistEntitiescall that returns the completePolarisBaseEntityrather than theEntityNameLookupRecord. A batchloadEntitiescall that functions similar to theloadEntity, when given an id, returns the identified entity, is also useful, notably for cases when you don't want to list all entities of a particular type (e.g., loading a set of Principal Roles or refreshing specific entities for theEntityCache).This introduces a new
loadResolvedEntitiesAPI and renames the previousloadEntitiestolistFullEntitiesto avoid ambiguity. The new API now mirrors theloadResolvedEntity...APIs, accepting either a list ofPolarisEntityIds orEntityNameLookupRecords. I usedEntityNameLookupRecordbecause that is the return type for the originallistEntitiesAPI, but also becausePolarisEntityCorerequires agrantVersion, which may not be present, e.g., if the caller only has the results of alistEntitiescall. I also wanted to mirror the existingloadEntityAPI, which requires anPolarisEntityTypeargument and thePolarisEntityIddoesn't contain a type field.I used the
ResolvedPolarisEntitytype and terminology in the API name in order to make the EntityCache API and the raw PolarisMetaStoreManager API the same. In part, this aims to start bringing the two APIs closer together so that the concept of the cache can one day be just an implementation detail, rather than part of the core business logic. The bulk load implementation in the cache mirrors the logic in the Resolver, in that it ensures that it always returns a snapshot consistent with the state of the persistence layer as it exists at a single point in time. This means that it validates that the entire batch of entities returned matches the entity versions and grant versions returned by a call to theloadEntitiesChangeTrackingAPI.